Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

ثبت نشده

چکیده

Proof. For the base case t = H + 1, since V 0 DR = V (s H+1) = 0, it is obvious that at the (H + 1)-th step the estimator is unbiased with 0 variance, and the theorem holds. For the inductive step, suppose the theorem holds for step t + 1. At time step t, we have: V t V H+1−t DR = E t V H+1−t DR 2 − E t V (s t) 2 = E t V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E t ρ t Q(s t , a t) − ρ t Q(s t , a t) + V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E t − ρ t ∆(s t , a t) + V (s t) + ρ t (r t − R(s t , a t)) + ρ t γ V H−t DR − E t+1 V (s t+1) 2 − V (s t) 2 + V t V (s t) (15) = E t E t − ρ t ∆(s t , a t) + V (s t) 2 − V (s t) 2 s t + E t E t+1 ρ 2 t (r t − R(s t , a t)) 2 + V t V (s t) + E t E t+1 ρ t γ V H−t DR − E t+1 V (s t+1) 2 = E t V t − ρ t ∆(s t , a t) + V (s t) s t + E t ρ 2 t V t+1 r t + E t ρ 2 t γ 2 V V H−t DR s t , a t + V t V (s t) = E t V t ρ t ∆(s t , a t) s t + E t ρ 2 t V t+1 r t + E t ρ 2 t γ 2 V t+1 V H−t DR + V t V (s t). This completes the proof. Note that from Eqn.(15) to the next step, we have used the fact that conditioned on s t and a t , …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL to real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the do...

متن کامل

Doubly Robust Off-policy Evaluation for Reinforcement Learning

We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications....

متن کامل

More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the mo...

متن کامل

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

Doubly Robust Policy Evaluation and Learning

We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions an...

متن کامل

Eligibility Traces for Off-Policy Policy Evaluation

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policie...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

ثبت نشده

چکیده

منابع مشابه

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Doubly Robust Off-policy Evaluation for Reinforcement Learning

More Robust Doubly Robust Off-policy Evaluation

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Doubly Robust Policy Evaluation and Learning

Eligibility Traces for Off-Policy Policy Evaluation

عنوان ژورنال:

اشتراک گذاری